Skip to content

[NVIDIA] feat: MiniMax M3 Day 0 support B300#1724

Merged
functionstackx merged 6 commits into
mainfrom
feat/minimax-m3-b300
Jun 12, 2026
Merged

[NVIDIA] feat: MiniMax M3 Day 0 support B300#1724
functionstackx merged 6 commits into
mainfrom
feat/minimax-m3-b300

Conversation

@cquil11

@cquil11 cquil11 commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

MiniMax-M3 MXFP8 day-zero single-node vLLM sweep on B300.

  • New config minimaxm3-fp8-b300-vllm (.github/configs/nvidia-master.yaml) — TP8/TP4/TEP/DEP plus a tp2-ep2 entry across 1k1k and 8k1k (40 jobs).
  • New bench script benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_b300.sh--block-size 128 (MSA sparse attention), --language-model-only, conc-scaled cudagraph capture, MXFP8 checkpoint; serves from the launch_b300-nv.sh MODEL/MODEL_PATH split (unstaged model -> writable /data/models).
  • Image: dedicated vllm/vllm-openai:minimax-m3 (already the cu130 build; M3 support unmerged upstream — [Model] Add MiniMax M3 support vllm-project/vllm#45381).

Status: full sweep green (40/40 + 5/5 GSM8K, zero failures). Pareto: TP8 wins latency (~65 tok/s/user @ c4); TP4+EP4 wins 1k1k throughput (1909 tok/s/GPU @ c512); TP4 wins 8k1k (591 tok/s/GPU @ c128). Runs: canary, full.

🤖 Generated with Claude Code


Note

Low Risk
Additive benchmark config and shell script only; no changes to core inference, auth, or shared runtime beyond new CI sweep jobs.

Overview
Adds day-zero single-node throughput coverage for MiniMax-M3 (MiniMaxAI/MiniMax-M3-MXFP8) on B300 via a new minimaxm3-fp8-b300-vllm entry in nvidia-master.yaml, using the dedicated vllm/vllm-openai:minimax-m3 image and fixed-seq-len sweeps at 1k1k and 8k1k across TP/EP and data-parallel attention layouts.

Introduces benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_b300.sh, which downloads the unstaged checkpoint to the B300 MODEL/MODEL_PATH layout, extends engine readiness timeout for large MXFP8 loads, and launches vLLM with mandatory --block-size 128, --language-model-only, and concurrency-scaled CUDA graph capture before running the standard serving benchmark (optional eval).

Documents the change in perf-changelog.yaml.

Reviewed by Cursor Bugbot for commit aff01bc. Bugbot is set up for automated code reviews on this repo. Configure here.

MXFP8 single-node vLLM sweep (TP/TEP/DEP, incl. tp2-ep2) for MiniMax-M3
on B300. --block-size 128 (MSA sparse attention), --language-model-only
for text-only throughput, dedicated vllm/vllm-openai:minimax-m3 image
(vllm-project/vllm#45381). Serves from the launch_b300-nv.sh MODEL_PATH
split (unstaged model -> writable /data/models).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions

Copy link
Copy Markdown
Contributor

Comment thread .github/configs/nvidia-master.yaml Outdated
Comment thread .github/configs/nvidia-master.yaml Outdated
Comment thread .github/configs/nvidia-master.yaml Outdated
@jasonlizhengjian

Copy link
Copy Markdown
Collaborator

btw these comments also apply for B200 so please mirror there

Address PR #1724 review: TP8+EP8 conc-start 128->4 (1k1k and 8k1k) to
probe whether TEP8 extends the min-latency frontier below plain TP8;
TP4+EP4 conc-start 128->64 (1k1k) to fill the mid-curve.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
cquil11 added a commit that referenced this pull request Jun 12, 2026
Mirror PR #1724 review changes to B200: TP8+EP8 conc-start 128->4
(1k1k and 8k1k) to probe whether TEP8 extends the min-latency frontier
below plain TP8; TP4+EP4 conc-start 128->64 (1k1k) to fill the mid-curve.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

Lower conc-start 4->1 on the latency-probing layouts (tp8, tp8+ep8, tp4)
for both 1k1k and 8k1k to capture single/dual-request min-latency points.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

@functionstackx

Copy link
Copy Markdown
Collaborator

/reuse-sweep-run

@functionstackx functionstackx merged commit bfd2371 into main Jun 12, 2026
5 of 6 checks passed
@functionstackx functionstackx deleted the feat/minimax-m3-b300 branch June 12, 2026 23:45
@github-actions

Copy link
Copy Markdown
Contributor

Oseltamivir added a commit that referenced this pull request Jun 13, 2026
Keep minimaxm3-fp8-b300-vllm single-node config from main (#1724)
alongside GB200/GB300 dynamo full sweep configs. Preserve GLM5
and M2.5-FP4-B300-TRT perf-changelog entries from main.
cquil11 added a commit that referenced this pull request Jun 13, 2026
* [NVIDIA] feat: MiniMax M3 Day 0 support B200

MXFP8 single-node vLLM sweep (TP/TEP/DEP) for MiniMax-M3 on B200.
--block-size 128 (MSA sparse attention), --language-model-only for
text-only throughput, dedicated vllm/vllm-openai:minimax-m3 image
(vllm-project/vllm#45381). Adds the b200-dgxc runner-type group and a
launch_b200-dgxc.sh MODEL_PATH case for the gharunner-staged weights.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* minimaxm3-fp8-b200-vllm: add perf-changelog entry

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* minimaxm3-fp8-b200-vllm: extend TEP8 to low conc for latency frontier

Mirror PR #1724 review changes to B200: TP8+EP8 conc-start 128->4
(1k1k and 8k1k) to probe whether TEP8 extends the min-latency frontier
below plain TP8; TP4+EP4 conc-start 128->64 (1k1k) to fill the mid-curve.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* minimaxm3-fp8-b200-vllm: add conc 1 and 2 to latency layouts

Lower conc-start 4->1 on the latency-probing layouts (tp8, tp8+ep8, tp4)
for both 1k1k and 8k1k to capture single/dual-request min-latency points.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

3 participants